Kaggle Competition https://www.kaggle.com/c/web-traffic-time-series-forecasting
Using Kaggel competition as:
Data sets are available here for download:
https://www.kaggle.com/c/web-traffic-time-series-forecasting/data
Information based on full dataset for reference.
train:
key:
Due to the size of data, need to take a sample and work with that.
Sample can be based on:
First crack:
Error in eval(expr, envir, enclos) : object 'train1' not found
Second crack:
54KB - much better for small computer
SAVE AND USE
subject <- read.csv("data-input/subject.csv", stringsAsFactors=FALSE)
Structure (extend for more date columns):
subject.temp <- subject[,c(1:3)]
str(subject.temp)
'data.frame': 12 obs. of 3 variables:
$ Page : chr "Howard_Hughes_en.wikipedia.org_desktop_all-agents" "Main_Page_en.wikipedia.org_desktop_all-agents" "Orange_Is_the_New_Black_en.wikipedia.org_desktop_all-agents" "Howard_Hughes_en.wikipedia.org_all-access_spider" ...
$ X2015.07.01: int 75357 11952559 28486 137 17207 168 109821 20381245 65947 34167 ...
$ X2015.07.02: int 5396 12344021 26685 113 14756 132 13122 20752194 60189 7528 ...
Pages:
subject <- subject %>% arrange(Page)
package 㤼㸱bindrcpp㤼㸲 was built under R version 3.3.3
subject$Page
[1] "Howard_Hughes_en.wikipedia.org_all-access_all-agents"
[2] "Howard_Hughes_en.wikipedia.org_all-access_spider"
[3] "Howard_Hughes_en.wikipedia.org_desktop_all-agents"
[4] "Howard_Hughes_en.wikipedia.org_mobile-web_all-agents"
[5] "Main_Page_en.wikipedia.org_all-access_all-agents"
[6] "Main_Page_en.wikipedia.org_all-access_spider"
[7] "Main_Page_en.wikipedia.org_desktop_all-agents"
[8] "Main_Page_en.wikipedia.org_mobile-web_all-agents"
[9] "Orange_Is_the_New_Black_en.wikipedia.org_all-access_all-agents"
[10] "Orange_Is_the_New_Black_en.wikipedia.org_all-access_spider"
[11] "Orange_Is_the_New_Black_en.wikipedia.org_desktop_all-agents"
[12] "Orange_Is_the_New_Black_en.wikipedia.org_mobile-web_all-agents"
3 components to Page:
Note that ‘all-access-all-agents’ is the total of the other variations.
Try with one page variation.
main.all.all <- subject %>% filter(Page=="Main_Page_en.wikipedia.org_all-access_all-agents")
main.all.all.ts <- main.all.all %>% gather(key=date, value=views, -Page)
main.all.all.ts$date <- sub("X", "", main.all.all.ts$date)
main.all.all.ts$date <- as.Date(main.all.all.ts$date, format="%Y.%m.%d")
main.all.all.ts$index <- 1:nrow(main.all.all.ts) ## index for time series data points
summary(main.all.all.ts)
Page date views index
Length:550 Min. :2015-07-01 Min. :13658940 Min. : 1.0
Class :character 1st Qu.:2015-11-15 1st Qu.:18098508 1st Qu.:138.2
Mode :character Median :2016-03-31 Median :19457533 Median :275.5
Mean :2016-03-31 Mean :21938511 Mean :275.5
3rd Qu.:2016-08-15 3rd Qu.:22212934 3rd Qu.:412.8
Max. :2016-12-31 Max. :67264258 Max. :550.0
chart.title <- "Daily Views for Main page - all access, all agents"
plot.ts1 <- ggplot(main.all.all.ts, aes(x=date, y=views))+geom_line()+
scale_y_continuous(labels=comma, expand=c(0,0))+theme_classic()+ggtitle(chart.title)
ggplotly(plot.ts1)
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
Take the example of Main page, all access, all agent to build time series model based on single time series.
References:
Info:
SAME CHART AS ABOVE WITH LOESS SMOOTHING ADDED (ggplot2 defaults)
plot.ts1+geom_smooth(method='loess')
chart.title <- "Same Plot with loess span set lower for more granularity"
plot.ts1+geom_smooth(method='loess', span=0.3)
plot.ts1+
geom_smooth(formula=y ~ x, method='loess', span=0.2, color='red', se=FALSE)+
geom_smooth(method='loess', span=0.4, color='orange', se=FALSE)+
geom_smooth(method='loess', span=0.6, color='green', se=FALSE)+
geom_smooth(method='loess', span=0.8, color='purple', se=FALSE)+
ggtitle("Same Plot with various smoothing lines (span adjusted, no conf. int.)")
Get loess model from existing data
loess1 <- loess(views ~ as.numeric(date), data=main.all.all.ts, span=0.3)
summary(loess1)
Call:
loess(formula = views ~ as.numeric(date), data = main.all.all.ts,
span = 0.3)
Number of Observations: 550
Equivalent Number of Parameters: 10.02
Residual Standard Error: 5885000
Trace of smoother matrix: 11.07 (exact)
Control settings:
span : 0.3
degree : 2
family : gaussian
surface : interpolate cell = 0.2
normalize: TRUE
parametric: FALSE
drop.square: FALSE
Apply Loess model to future time periods
## extend date range for prediction period
ndays <- 30 ## number of days to predict
pred.period <- data.frame(date=seq(min(main.all.all.ts$date),max(main.all.all.ts$date+ndays), by='days'))
## prediction with loess1 doesn't work because default loess doesn't extrapolate
#predict(loess1, pred.period, se=TRUE)
## new loess model: add control=...
sp <- 0.45 ## set span for model: lower number puts more weight on recent
loess2 <- loess(views ~ as.numeric(date), data=main.all.all.ts, control=loess.control(surface = 'direct'), span=sp)
## plot actual data with fitted data from model
plot.ts1+geom_line(aes(date, loess2$fitted, color='model'))
## predict with new loess model - extended period
pr <- predict(loess2, as.numeric(pred.period$date), se=TRUE)
## prediction (including existing data)
#pr[[1]] ## first object is prediction
## new data frame with dates and prediction
prediction <- pred.period %>% mutate(views.pred=pr[[1]])
## join date rate for prediction with existing data
main.pred <- left_join(prediction, main.all.all.ts, by='date') %>% select(-index)
## plot the result
chart.title <- "Daily Views for Main Page with Loess Model"
plot.ts2 <- ggplot(main.pred, aes(x=date, y=views))+geom_line()+
scale_y_continuous(labels=comma, expand=c(0,0))+theme_classic()+
ggtitle(chart.title)+geom_line(aes(date, views.pred, color='model+prediction'))
ggplotly(plot.ts2)
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
Span: 0.45 Number of days predicted: 30
Reference: * http://r-statistics.co/Time-Series-Analysis-With-R.html
## time series forumlation examples from above reference
# ts (inputData, frequency = 4, start = c(1959, 2)) # frequency 4 => Quarterly Data
# ts (1:10, frequency = 12, start = 1990) # freq 12 => Monthly data.
# ts (inputData, start=c(2009), end=c(2014), frequency=1) # Yearly Data
## See Notes below for explanation of frequency
ts.Main <- ts(main.all.all.ts$views, frequency=365, start=c(year(min(main.all.all.ts$date)), month(min(main.all.all.ts$date)), day(min(main.all.all.ts$date))))
ts.Main.all.all.wk <- ts(main.all.all.ts$views, frequency=52, start=c(year(min(main.all.all.ts$date)), month(min(main.all.all.ts$date)), day(min(main.all.all.ts$date))))
Notes:
More on ‘time series has no or less than 2 periods’ error:
* https://stat.ethz.ch/pipermail/r-help/2013-October/361047.html
Using ‘decompose’
decomposedRes <- decompose(ts.Main.all.all.wk, type='additive') ## type='mult' if multiplicative; 'additive' if additive
plot(decomposedRes)
Using ‘stl’
stl.style <- stl(ts.Main.all.all.wk, s.window='periodic')
plot(stl.style)
https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/
Reverting back to original data and forging ahead with prediction.
ts.Mainforecast
Holt-Winters exponential smoothing without trend and without seasonal component.
Call:
HoltWinters(x = ts.Main, beta = FALSE, gamma = FALSE)
Smoothing parameters:
alpha: 0.9999487
beta : FALSE
gamma: FALSE
Coefficients:
[,1]
a 26149449